Lung cancer accounts for more deaths than any other cancer in both men and women, about 28% of all cancer deaths. The prognosis for lung cancer is poor. The Cancer Genome Atlas (TCGA) has samples of the most common type of lung cancer called non-small cell lung cancers. Specifically, the subtypes being studied are called lung adenocarcinoma and lung squamous cell carcinoma. Here we analyze the expression profiles of those patients, accessible in the form of a raw RNA-seq counts using a pipeline based on the R/Bioconductor software package Rsubread.
We import the raw data from the SummarizedExperiment. It contains only one assay, in a matrix-like object with the genes in the row and the samples in the columns. The data is the counts for gene and sample.
library(SummarizedExperiment)
se <- readRDS(file.path("rawCounts", "seLUSC.rds"))
se
class: RangedSummarizedExperiment
dim: 20115 553
metadata(5): experimentData annotation cancerTypeCode
cancerTypeDescription objectCreationDate
assays(1): counts
rownames(20115): 1 2 ... 102724473 103091865
rowData names(3): symbol txlen txgc
colnames(553): TCGA.18.3406.01A.01R.0980.07
TCGA.18.3407.01A.01R.0980.07 ... TCGA.90.7767.11A.01R.2125.07
TCGA.92.7340.11A.01R.2045.07
colData names(549): type bcr_patient_uuid ...
lymph_nodes_aortic_pos_by_ihc lymph_nodes_aortic_pos_total
The raw table of counts contains RNA-seq data from 20115 genes and 553 samples.
We want to do a paired experiment. For this reason, from the 553 samples, we will keep those that share patient identification, which is in the factor bcr_patient_barcode.
We select those patient identification that are in normal tissue and in tumor tissue.
normal <- data.frame(colData(se)[colData(se)$type == 'normal',])$bcr_patient_barcode
tumor <- data.frame(colData(se)[colData(se)$type == 'tumor',])$bcr_patient_barcode
length(normal)
[1] 51
length(tumor)
[1] 502
We have 51 samples with normal tissue and 502 samples with tumoral tissue. Let’s see which of the normal samples share patient identification with tumor samples.
common_bcr_patient_barcode <- normal[normal %in% tumor]
length(common_bcr_patient_barcode)
[1] 47
47 patient identification are in normal samples and tumor samples. This means that for those patients, there were extraction for normal tissue and for tumoral tissue.
Now, we filter the SummarizedExperiment object to keep only these patients.
paired_seLUSC <- se[,colData(se)$bcr_patient_barcode %in% common_bcr_patient_barcode]
paired_seLUSC
class: RangedSummarizedExperiment
dim: 20115 94
metadata(5): experimentData annotation cancerTypeCode
cancerTypeDescription objectCreationDate
assays(1): counts
rownames(20115): 1 2 ... 102724473 103091865
rowData names(3): symbol txlen txgc
colnames(94): TCGA.22.4593.01A.21R.1820.07
TCGA.22.4609.01A.21R.2125.07 ... TCGA.90.7767.11A.01R.2125.07
TCGA.92.7340.11A.01R.2045.07
colData names(549): type bcr_patient_uuid ...
lymph_nodes_aortic_pos_by_ihc lymph_nodes_aortic_pos_total
Now, there are 20115 genes, the same as before because we have not filtered genes, and 94 samples, which are 47 normal samples and 47 tumoral samples, with the same patient identificacion.
length(unique(paired_seLUSC$bcr_patient_barcode))
[1] 47
length(levels(paired_seLUSC$bcr_patient_barcode))
[1] 495
Analyzing the length of the array that contains the patient identification and its levels, we see that, although we have discarded the patients that did not pass the filter of the paired design, the levels are still with all the patient identifications, 495.
When doing differential expression analysis, the levels are taken to build the design matrix. For this reason, as the leves are still the one in the raw data, we have to take out the levels that are not in the array of patient identifications.
paired_seLUSC$bcr_patient_barcode <- droplevels(paired_seLUSC$bcr_patient_barcode)
length(unique(paired_seLUSC$bcr_patient_barcode))
[1] 47
length(levels(paired_seLUSC$bcr_patient_barcode))
[1] 47
Now we have 47 patient identifications and also 47 levels. Now we can save this filtered SummarizedExperiment in an RDS file.
saveRDS(paired_seLUSC, file.path("results", "paired_seLUSC.rds"))
We start importing the table of counts from the SummarizedExperiment container that is already paired for the previous filtering process. The structure is the same: the rows represent genes and the columns represent samples.
library(SummarizedExperiment)
paired_se <- readRDS(file.path("results", "paired_seLUSC.rds"))
paired_se
class: RangedSummarizedExperiment
dim: 20115 94
metadata(5): experimentData annotation cancerTypeCode
cancerTypeDescription objectCreationDate
assays(1): counts
rownames(20115): 1 2 ... 102724473 103091865
rowData names(3): symbol txlen txgc
colnames(94): TCGA.22.4593.01A.21R.1820.07
TCGA.22.4609.01A.21R.2125.07 ... TCGA.90.7767.11A.01R.2125.07
TCGA.92.7340.11A.01R.2045.07
colData names(549): type bcr_patient_uuid ...
lymph_nodes_aortic_pos_by_ihc lymph_nodes_aortic_pos_total
The row table of counts contains RNA-seq data from 20115 genes and 94 samples, which come from 47 patients.
We explore the column, that corresponds to samples. It contains the phenotypic data, which in this case corresponds to clinical variables, and their corresponding metadata.
dim(colData(paired_se))
[1] 94 549
colData(paired_se)[1:5, 1:5]
DataFrame with 5 rows and 5 columns
type bcr_patient_uuid
<factor> <factor>
TCGA.22.4593.01A.21R.1820.07 tumor 8fbe1f9e-f2a6-4550-aabf-b06607b821f0
TCGA.22.4609.01A.21R.2125.07 tumor 47cbb242-a356-43ba-95c3-0b0c9ef90fbf
TCGA.22.5471.01A.01R.1635.07 tumor bece6b8e-5d6c-4dd6-85a3-9b3a9c670aa7
TCGA.22.5472.01A.01R.1635.07 tumor bd3bf142-7c14-4538-8a76-3c6e140fa01a
TCGA.22.5478.01A.01R.1635.07 tumor 4daf4a91-bc36-40c8-8fca-ea61b6706775
bcr_patient_barcode form_completion_date
<factor> <factor>
TCGA.22.4593.01A.21R.1820.07 TCGA-22-4593 2011-8-5
TCGA.22.4609.01A.21R.2125.07 TCGA-22-4609 2012-1-16
TCGA.22.5471.01A.01R.1635.07 TCGA-22-5471 2011-5-5
TCGA.22.5472.01A.01R.1635.07 TCGA-22-5472 2011-4-29
TCGA.22.5478.01A.01R.1635.07 TCGA-22-5478 2011-5-5
prospective_collection
<factor>
TCGA.22.4593.01A.21R.1820.07 NO
TCGA.22.4609.01A.21R.2125.07 NO
TCGA.22.5471.01A.01R.1635.07 NO
TCGA.22.5472.01A.01R.1635.07 NO
TCGA.22.5478.01A.01R.1635.07 NO
mcols(colData(paired_se), use.names=TRUE)
DataFrame with 549 rows and 2 columns
labelDescription
<character>
type sample type (tumor/normal)
bcr_patient_uuid bcr patient uuid
bcr_patient_barcode bcr patient barcode
form_completion_date form completion date
prospective_collection tissue prospective collection indicator
... ...
lymph_nodes_pelvic_pos_total total pelv lnp
lymph_nodes_aortic_examined_count total aor lnr
lymph_nodes_aortic_pos_by_he aln pos light micro
lymph_nodes_aortic_pos_by_ihc aln pos ihc
lymph_nodes_aortic_pos_total total aor-lnp
CDEID
<character>
type NA
bcr_patient_uuid NA
bcr_patient_barcode 2673794
form_completion_date NA
prospective_collection 3088492
... ...
lymph_nodes_pelvic_pos_total 3151828
lymph_nodes_aortic_examined_count 3104460
lymph_nodes_aortic_pos_by_he 3151832
lymph_nodes_aortic_pos_by_ihc 3151831
lymph_nodes_aortic_pos_total 3151827
These metadata consists of two columns of information about the clinical variables. One called labelDescription contains a succint description of the variable, often not more self-explanatory than the variable name itself, and the other called ‘CDEID’ corresponds to the Common Data Element (CDE) identifier. This identifier can be useed to search for further information about the associated clinical variable.
Now, we explore the row (feature) data.
rowData(paired_se)
DataFrame with 20115 rows and 3 columns
symbol txlen txgc
<character> <integer> <numeric>
1 A1BG 3322 0.5644190
2 A2M 4844 0.4882329
3 NAT1 2280 0.3942982
4 NAT2 1322 0.3895613
5 SERPINA3 3067 0.5249429
... ... ... ...
20111 POTEB 1706 0.4308324
20112 SNORD124 104 0.4903846
20113 SNORD121B 81 0.4074074
20114 GAGE10 538 0.5055762
20115 BRWD1-IT2 1028 0.5924125
rowRanges(paired_se)
GRanges object with 20115 ranges and 3 metadata columns:
seqnames ranges strand | symbol txlen
<Rle> <IRanges> <Rle> | <character> <integer>
1 chr19 [58345178, 58362751] - | A1BG 3322
2 chr12 [ 9067664, 9116229] - | A2M 4844
9 chr8 [18170477, 18223689] + | NAT1 2280
10 chr8 [18391245, 18401218] + | NAT2 1322
12 chr14 [94592058, 94624646] + | SERPINA3 3067
... ... ... ... . ... ...
100996331 chr15 [20835372, 21877298] - | POTEB 1706
101340251 chr17 [40027542, 40027645] - | SNORD124 104
101340252 chr9 [33934296, 33934376] - | SNORD121B 81
102724473 chrX [49303669, 49319844] + | GAGE10 538
103091865 chr21 [39313935, 39314962] + | BRWD1-IT2 1028
txgc
<numeric>
1 0.5644190
2 0.4882329
9 0.3942982
10 0.3895613
12 0.5249429
... ...
100996331 0.4308324
101340251 0.4903846
101340252 0.4074074
102724473 0.5055762
103091865 0.5924125
-------
seqinfo: 455 sequences (1 circular) from hg38 genome
We are going to compare normal samples with tumor samples. Exploring the metadata, we can see that we have 47 normal samples and 47 tumor samples. As they are paired, the number must be the same.
table(paired_se$type)
normal tumor
47 47
Normalitzation consists in the adjustment for sample and gene specific factors, to make gene expression values comparable across samples. This process is really important due to the fact that the samples may have been sequenced at different depth and that there may be sample specific biases related to technical differences in samples extarction, preparation and sequencing.
The normalitzation is done in two steps: * Within-sample: adjustments to compare across features in a sample. Scaling: using counts per million reads (CPM) mapped to the genome. Between-sample: adjustments to compare a feature across samples. Sample-specific normalization factors: using the TMM algorithm from the R/Bioconductor package edgeR. Quantile normalization: using the CQN algorithm from the R/Bioconductor package cqn or in the limma-voom pipeline.
To perform quality assessment and normalization we need first to load the edgeR R/Bioconductor package and create a DGEList object.
library(edgeR)
paired_dge <- DGEList(counts=assays(paired_se)$counts, genes=mcols(paired_se))
Warning in as.data.frame(x, row.names = NULL, optional = optional, ...):
Arguments in '...' ignored
names(paired_dge)
[1] "counts" "samples" "genes"
saveRDS(paired_dge, file.path("results", "paired_dge.rds"))
Once the ‘DGEList’ object is created, we can performe the scaling to CPM values. Therefore, \(\log_2\) CPM values of expression are calculated and used as an additional assay element to ease their manipulation. \(\log_2\) CPM units separate better high and low expression, than raw counts or non-logged CPM units.
assays(paired_se)$logCPM <- cpm(paired_dge, log=TRUE, prior.count=0.5)
assays(paired_se)$logCPM[1:5, 1:5]
TCGA.22.4593.01A.21R.1820.07 TCGA.22.4609.01A.21R.2125.07
1 2.433074 1.760274
2 9.013463 10.810625
9 -6.857108 -6.857108
10 -6.857108 -6.857108
12 5.605901 5.647349
TCGA.22.5471.01A.01R.1635.07 TCGA.22.5472.01A.01R.1635.07
1 2.122585 0.7750831
2 7.713748 11.1007352
9 -6.857108 -6.8571079
10 -6.857108 -6.8571079
12 4.823934 5.8569327
TCGA.22.5478.01A.01R.1635.07
1 3.713818
2 9.448173
9 -6.857108
10 -6.857108
12 4.920808
Let’s examine the sequencing depth in terms of total number of sequence read counts mapped to the genome per sample. Figure 1 below shows the sequencing depth per sample, also known as library sizes, in increasing order.
Figure 1: Library sizes in increasing order
There is the same number of tumor samples and normal samples and they seem to be randonly distributed. However, there is still high variability in the library size as can be observed in Figure 1.
Let’s look at the distribution of expression values per sample in terms of logarithmic CPM units. Due to the large number of samples, we display tumor and normal samples separately, and are shown in Figure 2
Figure 2: Non-parametric density distribution of expression profiles per sample
In Figure 2 we do not appreciate substantial differences between the samples in the distribution of expression values as.
Let’s calculate now the average expression per gene through all the samples. Figure 3 shows the distribution of those values across genes.
Figure 3: Distribution of average expression level per gene
RNA-seq expression profiles from lowly-expressed genes can lead to artifacts in downstream differential expression analyses. For this reason, it is common practice to remove them following some criteria, such as: filter out genes below a minimum average CPM (or log2 CPM) value throughout the samples or filter out genes with fewer than a given number of samples meeting a minimum CPM (or log2 CPM) cutoff. This graphic shows what would be the minimum average CPM (red line).
We can also make an MA-plot to see biases due to expression. First, we define the groups that we want to compare. In our case, we use the sample type to define groups, by modifying the DGEList object as follows:
paired_dge$samples$group <- paired_se$type
table(paired_dge$samples$group)
normal tumor
47 47
Here we generate the MA-plot using type sample as grouping factor.
Figure 4: MA-plot using type grouping
Figure 4 shows us the need of remove the lower experssed genes to normalize the samples.
In the light of this plot, we may consider a cutoff of 1 log CPM unit as minimum value of expression to select genes being expressed across samples. Using this cutoff we proceed to filter out lowly-expressed genes.
mask <- avgexp > 1
dim(paired_se)
[1] 20115 94
paired_se.filt <- paired_se[mask, ]
dim(paired_se.filt)
[1] 11866 94
paired_dge.filt <- paired_dge[mask, ]
dim(paired_dge.filt)
[1] 11866 94
After this filtering process, we end up with 11866 genes.
Store un-normalized versions of the filtered expression data.
saveRDS(paired_se.filt, file.path("results", "paired_se.filt.unnorm.rds"))
saveRDS(paired_dge.filt, file.path("results", "paired_dge.filt.unnorm.rds"))
We can also use a second approach to filter data using the CPM cutoff value of expression. We will keep only genes that have this minimum level of expression in at least as many samples as the smallest group of comparison. We are still comparing sample type.
cpmcutoff <- round(10/min(paired_dge$sample$lib.size/1e+06), digits = 1)
cpmcutoff
[1] 0.3
nsamplescutoff <- min(table(paired_se$type))
nsamplescutoff
[1] 47
After knowing these parameters we can proceed to mask the lower-expressed genes.
mask2 <- rowSums(cpm(paired_dge) > cpmcutoff) >= nsamplescutoff
paired_se.filt2 <- paired_se[mask2, ]
paired_dge.filt2 <- paired_dge[mask2, ]
dim(paired_se)
[1] 20115 94
dim(paired_se.filt)
[1] 11866 94
dim(paired_se.filt2)
[1] 14200 94
After this second kind of filtering, we end up with 14200. As we can see, the first filter is more stringent thant the second one.
Store un-normalized versions of the filtered expression data.
saveRDS(paired_se.filt2, file.path("results", "paired_se.filt2.unnorm.rds"))
saveRDS(paired_dge.filt2, file.path("results", "paired_dge.filt2.unnorm.rds"))
We will compare both approaches done before in order to compare them and see which one discards more genes and is more restrictive.
Figure 5: Distribution of average expression level per gene and filtering comparative
After comparing the different approaches used in the filter of low expressed genes, we have decided to contunue working with the first set of filtered genes because this is more restrictive. We can see in Figure 5 that the second approach is more conservative with our dataset and even genes with negative logCPM passes through this filter.
We calculate now the normalization factors on the filtered expression data set.
paired_dge.filt <- calcNormFactors(paired_dge.filt)
Replace the raw log2 CPM units in the corresponding assay element of the SummarizedExperiment object, by the normalized ones.
assays(paired_se.filt)$logCPM <- cpm(paired_dge.filt, log=TRUE, normalized.lib.sizes=TRUE, prior.count=0.25)
Store normalized versions of the filtered expression data.
saveRDS(paired_se.filt, file.path("results", "paired_se.filt.rds"))
saveRDS(paired_dge.filt, file.path("results", "paired_dge.filt.rds"))
We examine now the MA-plots of the normalized expression profiles. We look first to the tumor samples in Figure 7.
First, we define the groups that we want to compare. In our case, we use the sample type to define groups, by modifying the DGEList object as follows:
paired_dge$samples$group <- paired_se$type
table(paired_dge$samples$group)
normal tumor
47 47
paired_dge.filt$samples$group <- paired_se$type
Here we generate the MA-plot using type sample as grouping factor.
Figure 6: MA-plot using type grouping after filtering by low expression
In Figure 6 we don’t see the artifact caused by the discreteness of counts at low values, where ratios between low numbers may easy lead to large fold-changes. In general, fold-changes from large expression values are more reliable than those coming from low-expression values. We don’t see it because we have filtered by low-expression previously. Red line, which indicates us the tendency of the dots, shows that there is no bias, since it falls in the blue line. At the end it seems to decrease a little, but as we said, the fold-changes with a high average logCPM is more reliable. The problem would be if the biased was at the beginning.
Figure 7: MA-plots of the tumor samples
We do not observe samples with major expression-level dependent biases. Although some individual samples may have little bias, in general this is corrected. Let’s look now to the normal samples in Figure 8.
Figure 8: MA-plots of the normal samples
Figure 8: MA-plots of the normal samples
We do not observe neither important expression-level dependent biases among the normal samples in Figure 8. There is even less bias than in tumor samples.
We will search now for potential surrogate of batch effect indicators. Given that each sample names corresponds to a TCGA barcode (see https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode), following the strategy described in http://bioinformatics.mdanderson.org/main/TCGABatchEffects:Overview we are going to derive different elements of the TCGA barcode and examine their distribution across samples.
tss <- substr(colnames(paired_se.filt), 6, 7)
table(tss)
tss
22 33 34 39 43 51 56 58 60 77 85 90 92
20 4 4 2 16 6 16 2 2 14 2 4 2
center <- substr(colnames(paired_se.filt), 27, 28)
table(center)
center
07
94
plate <- substr(colnames(paired_se.filt), 22, 25)
table(plate)
plate
0980 1100 1635 1758 1820 1858 1949 2045 2125 2187 2296 2326 A28V
1 3 11 4 14 1 8 20 20 4 4 2 2
portionanalyte <- substr(colnames(paired_se.filt), 18, 20)
table(portionanalyte)
portionanalyte
01R 11R 21R 31R 41R
55 27 8 2 2
samplevial <- substr(colnames(paired_se.filt), 14, 16)
table(samplevial)
samplevial
01A 11A
47 47
From this information we can make the following observations:
All samples were sequenced at the same center
Samples belong to two vials.
Samples were collected across different tissue source sites (TSS).
Samples were sequenced within differnt plates. There are 13 different plates.
Samples were sequenced using different portions and analyte combinations. There are 5 different conditions described.
We are going to use the TSS as surrogate of batch effect indicator. Considering our outcome of interest as molecular changes between sample types, tumor vs. normal, we will examine now the cross-classification of this outcome with TSS.
table(data.frame(TYPE=paired_se.filt$type, TSS=tss))
TSS
TYPE 22 33 34 39 43 51 56 58 60 77 85 90 92
normal 10 2 2 1 8 3 8 1 1 7 1 2 1
tumor 10 2 2 1 8 3 8 1 1 7 1 2 1
We observe that we have the same TSS in normal and tumor samples, so the TSS is not a source of variability, since they are all the same.
We can also examine the other three parameters that can lead to variability due to technical issues. We are not interested in this variability and can be a source of confussion.
table(data.frame(TYPE=paired_se.filt$type, PORTIONALYTE=portionanalyte))
PORTIONALYTE
TYPE 01R 11R 21R 31R 41R
normal 44 3 0 0 0
tumor 11 24 8 2 2
table(data.frame(TYPE=paired_se.filt$type, PLATE=plate))
PLATE
TYPE 0980 1100 1635 1758 1820 1858 1949 2045 2125 2187 2296 2326 A28V
normal 0 0 5 4 7 1 4 10 10 2 2 1 1
tumor 1 3 6 0 7 0 4 10 10 2 2 1 1
table(data.frame(TYPE=paired_se.filt$type, SAMPLEVIAL=samplevial))
SAMPLEVIAL
TYPE 01A 11A
normal 0 47
tumor 47 0
All normal samples belong to the same vial and all tumor samples belong to the same vial. As it is remarked in the documentation, this is the correct sample vial for tumoral tissue and for normal tissue. So it does not provoque any variability. Moreover, they are all from the same aliquot (A).
The plate is almost equal between normal and tumor.
For these reasons, the variables that may lead into batch effects are not leading to it, as commented before.
To ilustrate how samples are grouped together by hierarchical clustering and multidimensional scaling, we draw the Dendrogram plot (Figure 9) and the MDS plot (Figure 10). We calculate again log CPM values with a higher prior count to moderate extreme fold-changes produced by low counts.
Figure 9: Hierarchical clustering of the samples
We can observe that samples cluster primarily by sample type: tumor or normal.
In Figure 10 we show the corresponding MDS plot. Here we see more clearly that tumor and normal samples are separated. We can also observe that one tumor samples, corresponding to individual 8623 is in the normal cluster and also happens in the MDS plot. A closer examination of their corresponding MA-plots also reveals a slight dependence of expression changes on average expression. We may consider discarding this sample and doing the MDS plot again to have a closer look to the differences among the rest of the samples.
Figure 10: Multidimensional scaling plot of the samples
For the former case, assume that from the previous MDS plot we could decide discard that mentioned sample, which is identified with the number 8623. But we will not do it because we have problems with the matrix size after removing this sample.
As we have not eliminate batch effects, because we have decided that there is no significant since the samples are paired and most of the technical variables are compensated or the same, we will keep working with the SummarizedExperiment object from filtered genes.
Here is seen how the sample disapears after it is removed, then all the normal samples and tumor samples cluster together with any exception (Figure 11 and Figure 12.
maskbad <- colnames(paired_se.filt) %in% colnames(paired_se.filt)[substr(colnames(paired_se.filt), 9, 12) == "8623"]
dim(paired_se.filt)
[1] 11866 94
dim(paired_dge.filt)
[1] 11866 94
paired_se.filt <- paired_se.filt[, !maskbad]
paired_dge.filt <- paired_dge.filt[, !maskbad]
dim(paired_se.filt)
[1] 11866 92
dim(paired_dge.filt)
[1] 11866 92
Figure 11: Hierarchical clustering of the samples
Figure 12: Multidimensional scaling plot of the samples
saveRDS(paired_se.filt, file.path("results", "paired_se.filt2.rds"))
saveRDS(paired_dge.filt, file.path("results", "paired_dge.filt2.rds"))
We perform differentual expression analysis using limma pipeline, combined with the surrogate variable analysis using the R/Bioconductor package sva.
First, we do the SVA to adjust for surrogated variables. In the design matrix is also included the patient identifications.
library(sva)
mod <- model.matrix(~type + bcr_patient_barcode, data = colData(se.filt))
mod0 <- model.matrix(~1, data = colData(se.filt))
sv <- sva(assays(se.filt)$logCPM, mod = mod, mod0 = mod0)
Number of significant surrogate variables is: 11
Iteration (out of 5 ):1 2 3 4 5
sv$n
[1] 11
There are 11 surrogate variables. We combine the model matrix with the surrogate variables.
mod <- cbind(mod, sv$sv)
colnames(mod) <- c(colnames(mod)[1:(nlevels(se.filt$bcr_patient_barcode)+1)], paste0("SV", 1:sv$n))
Now we perform limma-voom method to adjust for mean-variance relationship. We use lmFit function to calculate the linear model and eBayes function to calculate the moderated t-statistic. Finally, decideTests function will classify into upregulated, downregulated or non-significant.
v <- voom(dge.filt, mod, plot=TRUE)
Figure 13: Voom plot
fit <- lmFit(v, mod)
fit <- eBayes(fit)
FDRcutoff <- 0.1
res <- decideTests(fit, p.value = FDRcutoff)
summary(res)
(Intercept) typetumor bcr_patient_barcodeTCGA-22-4609
Down 2 0 4
NotSig 473 11866 11861
Up 11391 0 1
bcr_patient_barcodeTCGA-22-5471 bcr_patient_barcodeTCGA-22-5472
Down 2 1
NotSig 11862 11863
Up 2 2
bcr_patient_barcodeTCGA-22-5478 bcr_patient_barcodeTCGA-22-5481
Down 0 6
NotSig 11865 11858
Up 1 2
bcr_patient_barcodeTCGA-22-5482 bcr_patient_barcodeTCGA-22-5483
Down 11 2
NotSig 11852 11863
Up 3 1
bcr_patient_barcodeTCGA-22-5489 bcr_patient_barcodeTCGA-22-5491
Down 3 4
NotSig 11860 11860
Up 3 2
bcr_patient_barcodeTCGA-33-4587 bcr_patient_barcodeTCGA-33-6737
Down 6 5
NotSig 11859 11858
Up 1 3
bcr_patient_barcodeTCGA-34-7107 bcr_patient_barcodeTCGA-34-8454
Down 1 18
NotSig 11864 11836
Up 1 12
bcr_patient_barcodeTCGA-39-5040 bcr_patient_barcodeTCGA-43-3394
Down 4 5
NotSig 11860 11856
Up 2 5
bcr_patient_barcodeTCGA-43-5670 bcr_patient_barcodeTCGA-43-6143
Down 1 2
NotSig 11863 11862
Up 2 2
bcr_patient_barcodeTCGA-43-6647 bcr_patient_barcodeTCGA-43-6771
Down 9 4
NotSig 11855 11860
Up 2 2
bcr_patient_barcodeTCGA-43-6773 bcr_patient_barcodeTCGA-43-7657
Down 8 13
NotSig 11832 11848
Up 26 5
bcr_patient_barcodeTCGA-43-7658 bcr_patient_barcodeTCGA-51-4079
Down 8 22
NotSig 11857 11834
Up 1 10
bcr_patient_barcodeTCGA-51-4080 bcr_patient_barcodeTCGA-51-4081
Down 2 3
NotSig 11862 11861
Up 2 2
bcr_patient_barcodeTCGA-56-7222 bcr_patient_barcodeTCGA-56-7579
Down 2 3
NotSig 11855 11860
Up 9 3
bcr_patient_barcodeTCGA-56-7580 bcr_patient_barcodeTCGA-56-7582
Down 20 9
NotSig 11810 11839
Up 36 18
bcr_patient_barcodeTCGA-56-7730 bcr_patient_barcodeTCGA-56-7731
Down 1 7
NotSig 11862 11856
Up 3 3
bcr_patient_barcodeTCGA-56-8309 bcr_patient_barcodeTCGA-56-8623
Down 2 4
NotSig 11858 11861
Up 6 1
bcr_patient_barcodeTCGA-58-8386 bcr_patient_barcodeTCGA-60-2709
Down 1 3
NotSig 11864 11861
Up 1 2
bcr_patient_barcodeTCGA-77-7138 bcr_patient_barcodeTCGA-77-7142
Down 5 17
NotSig 11858 11843
Up 3 6
bcr_patient_barcodeTCGA-77-7335 bcr_patient_barcodeTCGA-77-7337
Down 9 1
NotSig 11852 11862
Up 5 3
bcr_patient_barcodeTCGA-77-7338 bcr_patient_barcodeTCGA-77-8007
Down 5 9
NotSig 11857 11856
Up 4 1
bcr_patient_barcodeTCGA-77-8008 bcr_patient_barcodeTCGA-85-7710
Down 28 88
NotSig 11825 11671
Up 13 107
bcr_patient_barcodeTCGA-90-6837 bcr_patient_barcodeTCGA-90-7767
Down 7 13
NotSig 11852 11846
Up 7 7
bcr_patient_barcodeTCGA-92-7340 SV1 SV2 SV3 SV4 SV5 SV6
Down 13 1606 2150 2551 1744 1900 1184
NotSig 11847 8219 7875 7238 8646 8313 9463
Up 6 2041 1841 2077 1476 1653 1219
SV7 SV8 SV9 SV10 SV11
Down 747 648 895 303 384
NotSig 10396 10482 10209 11399 11300
Up 723 736 762 164 182
There is no differentially expressed gene. All genes are non-significant for the type variable, which is our variable of interest.
We can build a table to classify and sort the genes by p-value.
genesmd <- data.frame(chr = as.character(seqnames(rowRanges(se.filt))), symbol = rowData(se.filt)[, 1], stringsAsFactors = FALSE)
fit$genes <- genesmd
tt <- topTable(fit, coef = 2, n = Inf)
head(tt, n = 10)
chr symbol logFC AveExpr t P.Value adj.P.Val
3737 chr1 KCNA2 1.6261737 7.016262 2.772653 0.0084270730 0.9997256
6670 chr2 SP3 -2.2126834 6.867519 -2.642088 0.0117319514 0.9997256
7334 chr12 UBE2N -4.0600224 2.192373 -4.235944 0.0001311969 0.9997256
56999 chr3 ADAMTS9 -2.5188837 4.158329 -3.013702 0.0044805536 0.9997256
23409 chr12 SIRT4 -0.7626027 5.447291 -2.400657 0.0211469578 0.9997256
54465 chr2 ETAA1 0.9422784 6.564915 2.386613 0.0218635472 0.9997256
339967 chr4 TMPRSS11A 3.5046682 6.821845 2.505872 0.0164193440 0.9997256
5782 chr7 PTPN12 1.3602872 8.734349 2.342342 0.0242683012 0.9997256
2219 chr9 FCN1 1.6653676 5.230000 2.669790 0.0109441892 0.9997256
80335 chr3 WDR82 1.0508389 5.966129 2.363655 0.0230823491 0.9997256
B
3737 -4.405096
6670 -4.420213
7334 -4.441569
56999 -4.442297
23409 -4.451648
54465 -4.453891
339967 -4.454001
5782 -4.456015
2219 -4.461125
80335 -4.462878
As we can see in the table, the adjusted p-values are close to 1. For this reason, they are far from being significant.
Under the null-hypothesis, the distribution of raw p-values must be uniform. We plot a histogram with the p-values and a QQ-plot (Figure 14).
Both plots show that the disribution is far from being uniform. This may be because some variability not explained neither by the biological factor nor by the surrogate variables. More quality assessments could be done in order to correct for this biases.
Figure 14: p-values and QQ-plot
As there is no differentially expressed gene, we cannot do any classical functional enrichment, since it is necessary to have a list of differentially expressed genes.
geneUniverse <- rownames(se.filt)
length(geneUniverse)
[1] 11866
In this case a method for pathway analysis that addresses this shortcoming by assessing differential expression directly at gene set level is used. Therfore, small but consistent changes occurring for a number of genes operating in a common pathway will be found. To perform this it is calculated for each gene set an enrichment score (ES) as function of the changes in gene expression by the genes forming the gene set.
The used gene set collection has been downloaded from GSEA. It is configured of gene sets that represent signatures of cellular pathways which are often disregulated in cancer.
GeneSetCollection
names: GLI1_UP.V1_DN, GLI1_UP.V1_UP, ..., LEF1_UP.V1_UP (189 total)
unique identifiers: 22818, 143384, ..., 79649 (11250 total)
types in collection:
geneIdType: EntrezIdentifier (1 total)
collectionType: NullCollection (1 total)
[1] 189
[1] "GLI1_UP.V1_DN" "GLI1_UP.V1_UP" "E2F1_UP.V1_DN" "E2F1_UP.V1_UP"
[5] "EGFR_UP.V1_DN" "EGFR_UP.V1_UP"
There are 189 gene sets in this gene sets collection.
First we map the identifiers from the gene sets to the identifiers of the data we are going to analyze.
gsc <- mapIdentifiers(entrezOncogens, AnnoOrEntrezIdentifier(metadata(se.filt)$annotation))
gsc
GeneSetCollection
names: GLI1_UP.V1_DN, GLI1_UP.V1_UP, ..., LEF1_UP.V1_UP (189 total)
unique identifiers: 22818, 143384, ..., 79649 (11250 total)
types in collection:
geneIdType: EntrezIdentifier (1 total)
collectionType: NullCollection (1 total)
in this case, nothing has happend, we could jump this step because data is already anchorated to Entrez identifiers. Now, we start with an incidence matrix indicating what genes belong to what gene set.
Im <- incidence(gsc)
dim(Im)
[1] 189 11250
Im[1:2, 1:10]
22818 143384 140711 57583 81669 54432 79712 23596 91543 6580
GLI1_UP.V1_DN 1 1 1 1 1 1 1 1 1 1
GLI1_UP.V1_UP 0 0 0 0 0 0 0 0 0 0
The incidence matrix is a matrix in which the rows represent the gene sets, the columns represent the gene in entrez identifier, and the data is 1 or 0, depending whether the gene is in the gene set or it is not.
Next, we discard genes (columns in the incidence matrix) that do not form part of our data.
Im <- Im[, colnames(Im) %in% rownames(se.filt)]
dim(Im)
[1] 189 6067
From 11250 of the genes in the gene sets, only 6067 are in our experiment. The rest may have been filtered during the quality assessment process.
As not all the genes in our experiment are present in the gene sets, we discard them.
se.filt <- se.filt[colnames(Im), ]
dim(se.filt)
[1] 6067 94
dge.filt <- dge.filt[colnames(Im), ]
dim(dge.filt)
[1] 6067 94
Now we perform limma pipeline combined with SVA to create a table of genes and their t-statistic to do the GSEA. The pipeline is the same as in differential expression analysis, but now with less genes (we have filtered them)
library(limma)
library(sva)
mod <- model.matrix(~se.filt$type + bcr_patient_barcode, data = colData(se.filt))
mod0 <- model.matrix(~1, data = colData(se.filt))
sv <- sva(assays(se.filt)$logCPM, mod = mod, mod0 = mod0)
Number of significant surrogate variables is: 11
Iteration (out of 5 ):1 2 3 4 5
mod <- cbind(mod, sv$sv)
colnames(mod) <- c(colnames(mod)[1:48], paste0("SV", 1:sv$n))
v <- voom(dge.filt, mod)
fit <- lmFit(assays(se.filt)$logCPM, mod)
fit <- eBayes(fit, trend = TRUE)
tt <- topTable(fit, coef = 2, n = Inf)
Now we can calculate the z-score, that gives us more robustness about analyzing the moderated t-statistic. We select only those gene sets with more than 5 genes and sort the z-score to see the most significant gene sets.
Im <- Im[rowSums(Im) >= 5, ]
dim(Im)
[1] 189 6067
tGSgenes <- tt[match(colnames(Im), rownames(tt)), "t"]
length(tGSgenes)
[1] 6067
head(tGSgenes)
[1] -0.1480484 -0.6165308 -0.3180018 0.9846855 -1.3489488 0.2229815
zS <- sqrt(rowSums(Im)) * (as.vector(Im %*% tGSgenes)/rowSums(Im))
length(zS)
[1] 189
head(zS)
GLI1_UP.V1_DN GLI1_UP.V1_UP E2F1_UP.V1_DN E2F1_UP.V1_UP EGFR_UP.V1_DN
0.0813339 0.5123863 1.7608323 0.7978159 2.3626425
EGFR_UP.V1_UP
1.1886709
rnkGS <- sort(abs(zS), decreasing = TRUE)
head(rnkGS)
ERB2_UP.V1_DN MEK_UP.V1_DN CAHOY_OLIGODENDROCUTIC
3.040329 2.712168 2.507218
WNT_UP.V1_DN KRAS.600_UP.V1_UP EGFR_UP.V1_DN
2.397208 2.376572 2.362642
A QQ plot (Figure 15) may show visually how the z-scores are distributed.
Figure 15: QQ-plot
This QQ plot shows that the z-scores do not follow a normal distribution. Under the null hypothesis, those z-scores representing gene sets that are not enriched follow a normal distribution, but this does not happens. Maybe because overlapping gene sets or because as this gene set is specific for cancer and we are analyzing cancer cells, the pathways are all significant.
Let’s adjust for the p-value to fins differential expressed gene sets.
pv <- pmin(pnorm(zS), 1 - pnorm(zS))
pvadj <- p.adjust(pv, method = "fdr")
DEgs <- names(pvadj)[which(pvadj < 0.01)]
length(DEgs)
[1] 0
There is no gene set enriched.
GSVA is another way to analyze the gene sets without a list of differentially expressed genes. This has some properties that can be useful when differential expression analysis and GSEA do not work.
library(GSVA)
GSexpr <- gsva(assays(se.filt)$logCPM, gsc, min.sz=5, max.sz=300, verbose=FALSE)
dim(GSexpr)
[1] 189 94
We create an expression data matrix, in which the rows are gene sets, the columns are samples and the data is the ES. We have 189 gene sets and 94 samples. We perform SVA with this matrix and perform a differential expression pipeline, but instead of doing it with genes, we do it with gene sets.
mod <- model.matrix(~se.filt$type + bcr_patient_barcode, data = colData(se.filt))
mod0 <- model.matrix(~1, data = colData(se.filt))
svaobj <- sva(GSexpr, mod, mod0)
Number of significant surrogate variables is: 7
Iteration (out of 5 ):1 2 3 4 5
modSVs <- cbind(mod, svaobj$sv)
fit <- lmFit(GSexpr, modSVs)
fit <- eBayes(fit)
tt <- topTable(fit, coef = 2, n = Inf)
DEgs <- rownames(tt[tt$adj.P.Val < 0.01, , drop = FALSE])
length(DEgs)
[1] 105
At the end, we have 105 gene sets differentially expressed. The most significant ones can be seen at the top table tt.
head(tt)
logFC AveExpr t P.Value
JNK_DN.V1_DN -0.2591644 0.02798997 -14.64205 1.946661e-19
HINATA_NFKB_MATRIX -0.6320939 0.05210125 -13.05664 1.680177e-17
ATF2_S_UP.V1_DN -0.2084399 0.02713102 -12.67754 5.110700e-17
BCAT_GDS748_DN -0.4665696 0.04177321 -12.24179 1.877812e-16
ALK_DN.V1_DN 0.2372378 -0.01974942 11.72500 9.071760e-16
RELA_DN.V1_DN -0.1993592 0.01855533 -10.84520 1.434217e-14
adj.P.Val B
JNK_DN.V1_DN 3.679190e-17 34.05338
HINATA_NFKB_MATRIX 1.587768e-15 29.65628
ATF2_S_UP.V1_DN 3.219741e-15 28.55640
BCAT_GDS748_DN 8.872660e-15 27.26864
ALK_DN.V1_DN 3.429125e-14 25.70860
RELA_DN.V1_DN 4.517782e-13 22.97125
Although the log fold-change is not large, they are very significant.
A volcano plot can help us to visualize the results. In figure 16 It is seen to be a lot of gene sets differentially expressed as well.
Figure 16: Volcano plot
(#fig:voom2, )Voom plot
(Intercept) typetumor bcr_patient_barcodeTCGA-22-4609
Down 19 4402 0
NotSig 550 2147 11865
Up 11297 5317 1
bcr_patient_barcodeTCGA-22-5471 bcr_patient_barcodeTCGA-22-5472
Down 1 1
NotSig 11860 11863
Up 5 2
bcr_patient_barcodeTCGA-22-5478 bcr_patient_barcodeTCGA-22-5481
Down 1 8
NotSig 11864 11854
Up 1 4
bcr_patient_barcodeTCGA-22-5482 bcr_patient_barcodeTCGA-22-5483
Down 1 1
NotSig 11864 11864
Up 1 1
bcr_patient_barcodeTCGA-22-5489 bcr_patient_barcodeTCGA-22-5491
Down 0 1
NotSig 11866 11863
Up 0 2
bcr_patient_barcodeTCGA-33-4587 bcr_patient_barcodeTCGA-33-6737
Down 49 2
NotSig 11727 11859
Up 90 5
bcr_patient_barcodeTCGA-34-7107 bcr_patient_barcodeTCGA-34-8454
Down 0 10
NotSig 11865 11853
Up 1 3
bcr_patient_barcodeTCGA-39-5040 bcr_patient_barcodeTCGA-43-3394
Down 38 1
NotSig 11757 11863
Up 71 2
bcr_patient_barcodeTCGA-43-5670 bcr_patient_barcodeTCGA-43-6143
Down 3 0
NotSig 11859 11865
Up 4 1
bcr_patient_barcodeTCGA-43-6647 bcr_patient_barcodeTCGA-43-6771
Down 6 0
NotSig 11859 11864
Up 1 2
bcr_patient_barcodeTCGA-43-6773 bcr_patient_barcodeTCGA-43-7657
Down 4 10
NotSig 11851 11851
Up 11 5
bcr_patient_barcodeTCGA-43-7658 bcr_patient_barcodeTCGA-51-4079
Down 53 5
NotSig 11736 11858
Up 77 3
bcr_patient_barcodeTCGA-51-4080 bcr_patient_barcodeTCGA-51-4081
Down 1 3
NotSig 11864 11853
Up 1 10
bcr_patient_barcodeTCGA-56-7222 bcr_patient_barcodeTCGA-56-7579
Down 1 0
NotSig 11859 11863
Up 6 3
bcr_patient_barcodeTCGA-56-7580 bcr_patient_barcodeTCGA-56-7582
Down 3 3
NotSig 11862 11860
Up 1 3
bcr_patient_barcodeTCGA-56-7730 bcr_patient_barcodeTCGA-56-7731
Down 24 7
NotSig 11817 11856
Up 25 3
bcr_patient_barcodeTCGA-56-8309 bcr_patient_barcodeTCGA-56-8623
Down 4 349
NotSig 11853 11191
Up 9 326
bcr_patient_barcodeTCGA-58-8386 bcr_patient_barcodeTCGA-60-2709
Down 0 1
NotSig 11864 11864
Up 2 1
bcr_patient_barcodeTCGA-77-7138 bcr_patient_barcodeTCGA-77-7142
Down 7 126
NotSig 11857 11701
Up 2 39
bcr_patient_barcodeTCGA-77-7335 bcr_patient_barcodeTCGA-77-7337
Down 74 7
NotSig 11756 11854
Up 36 5
bcr_patient_barcodeTCGA-77-7338 bcr_patient_barcodeTCGA-77-8007
Down 2 5
NotSig 11861 11860
Up 3 1
bcr_patient_barcodeTCGA-77-8008 bcr_patient_barcodeTCGA-85-7710
Down 6 30
NotSig 11858 11787
Up 2 49
bcr_patient_barcodeTCGA-90-6837 bcr_patient_barcodeTCGA-90-7767
Down 0 7
NotSig 11866 11858
Up 0 1
bcr_patient_barcodeTCGA-92-7340
Down 10
NotSig 11850
Up 6
Table of results:
genesmd <- data.frame(chr = as.character(seqnames(rowRanges(se.filt))), symbol = rowData(se.filt)[, 1], stringsAsFactors = FALSE)
fit$genes <- genesmd
tt <- topTable(fit, coef = 2, n = Inf)
head(tt, n = 10)
chr symbol logFC AveExpr t P.Value adj.P.Val
692205 chr2 SNORD89 4.558016 4.061933 29.03018 5.211431e-33 6.183884e-29
9263 chr7 STK17A 5.510503 7.131354 28.35326 1.590799e-32 9.438213e-29
171482 chrX FAM9A 6.615355 1.406490 28.08485 2.492491e-32 9.858632e-29
121391 chr12 KRT74 3.723965 3.686091 26.15339 7.085033e-31 2.019640e-27
6100 chr7 RP9 3.665029 4.745683 26.05106 8.510198e-31 2.019640e-27
94134 chr10 ARHGAP12 3.110953 3.727910 25.90667 1.103364e-30 2.182085e-27
27316 chrX RBMX 4.123045 3.581820 25.37018 2.927936e-30 4.963270e-27
81556 chr15 VWA9 4.322738 2.008536 25.09129 4.897154e-30 7.263703e-27
10523 chr19 CHERP 2.953520 3.510743 24.88294 7.214682e-30 9.512158e-27
93129 chr16 ORAI3 4.155809 3.147168 24.79035 8.577857e-30 1.017848e-26
B
692205 65.01160
9263 64.02774
171482 63.31797
121391 60.18915
6100 60.05726
94134 59.76904
27316 58.77257
81556 58.16998
10523 57.90230
93129 57.68415
As can be seen looking at the p-values, there are significant differential expressed genes.
DE genes:
DEgenes <- rownames(tt)[tt$adj.P.Val < FDRcutoff]
length(DEgenes)
[1] 9719
saveRDS(DEgenes, file.path("results", "DEgenes.rds"))
There are 9719.
Figure 17: Top 10 MA plot
In the volcano plot, Figure 17, is seen the top 10 genes, all of them with a large magnitude fold-change and high statistical significance.
Both plots in Figure 18 show that the disribution is far from being uniform. This may be because some variability not explained by the biological factor. Moreover, surrogate variables are not analysed in this case. More quality assessments could be done in order to correct for this biases.
Figure 18: p-values
We will start with a Gene Onthology analysis of the DE genes. The first step is safe the identifiers of the of the original set of gene profiled. Then we will follow the main steps of the GO analysis. This steps are: 1) build a parameter object with information specifying the gene universe, the set of DE genes, the annotation package to use, and so one, 2) run the functional enrichment analysis, and 3) store and visualize the results.
geneUniverse <- rownames(se.filt)
length(geneUniverse)
[1] 11866
library(GOstats)
params <- new("GOHyperGParams", geneIds=DEgenes, universeGeneIds=geneUniverse,
annotation="org.Hs.eg.db", ontology="BP",
pvalueCutoff=0.05, testDirection="over")
Since a problem in a GO analysis is that the hierarchy of GO terms and their overlap render highly dependent tests. A conditional test is directly used. Then, if parent and child GO term contain the same significant genes, the child node will be retrieved because it is more specific. The no-conditional test is also done to allow the comparison between the two test results.
hgOver <- hyperGTest(params)
hgOver
Gene to GO BP test for over-representation
13342 GO BP ids tested (136 have p < 0.05)
Selected gene set size: 7794
Gene universe size: 9460
Annotation package: org.Hs.eg
conditional(params) <- TRUE
hgOverCond <- hyperGTest(params)
hgOverCond
Gene to GO BP Conditional test for over-representation
13342 GO BP ids tested (85 have p < 0.05)
Selected gene set size: 7794
Gene universe size: 9460
Annotation package: org.Hs.eg
As spected we have less significant GO terms when conditional test is used than when we perform a non-conditional test. The number of significant GO terms have chenged from 136 to 85.
This is the data.frame object with the results:
goresults <- summary(hgOverCond)
head(goresults)
GOBPID Pvalue OddsRatio ExpCount Count Size
1 GO:0045844 0.001653831 Inf 27.18837 33 33
2 GO:0051153 0.004324410 Inf 23.06004 28 28
3 GO:0055013 0.005309072 Inf 22.24503 27 27
4 GO:0043038 0.007830366 Inf 20.59725 25 25
5 GO:0010501 0.008055666 7.510633 29.66004 35 36
6 GO:0007507 0.008726619 1.588209 205.87924 220 250
Term
1 positive regulation of striated muscle tissue development
2 regulation of striated muscle cell differentiation
3 cardiac muscle cell development
4 amino acid activation
5 RNA secondary structure unwinding
6 heart development
GO terms involving a few genes (e.g., < 3) in their size (m) and in their enrichment by DE genes (k) are likely to be less reliable than those that involve many genes. Likewise, large GO terms may provide little insight. To try to spot the more interesting and reliable GO terms we can filter the previous results by a minimum value on the Count and Size columns, a maximum Count value, and order them by the OddsRatio column:
goresults <- goresults[goresults$Size >= 3 & goresults$Size <= 300 & goresults$Count >= 3, ]
goresults <- goresults[order(goresults$OddsRatio, decreasing=TRUE), ]
head(goresults)
GOBPID Pvalue OddsRatio ExpCount Count Size
1 GO:0045844 0.001653831 Inf 27.18837 33 33
2 GO:0051153 0.004324410 Inf 23.06004 28 28
3 GO:0055013 0.005309072 Inf 22.24503 27 27
4 GO:0043038 0.007830366 Inf 20.59725 25 25
9 GO:0006418 0.011547980 Inf 18.94947 23 23
12 GO:0010591 0.014023396 Inf 18.12558 22 22
Term
1 positive regulation of striated muscle tissue development
2 regulation of striated muscle cell differentiation
3 cardiac muscle cell development
4 amino acid activation
9 tRNA aminoacylation for protein translation
12 regulation of lamellipodium assembly
In this case a method for pathway analysis that addresses this shortcoming by assessing DE directly at gene set level is used. Therfore, small but consistent changes occurring for a number of genes operating in a common pathway will be found. To perform this it is calculated for each gene set an enrichment score (ES) as function of the changes in gene expression by the genes forming the gene set.
The used gene set collection has been downloaded from GSEA. It is configured of gene sets that represent signatures of cellular pathways which are often disregulated in cancer.
GeneSetCollection
names: GLI1_UP.V1_DN, GLI1_UP.V1_UP, ..., LEF1_UP.V1_UP (189 total)
unique identifiers: 22818, 143384, ..., 79649 (11250 total)
types in collection:
geneIdType: EntrezIdentifier (1 total)
collectionType: NullCollection (1 total)
[1] 189
[1] "GLI1_UP.V1_DN" "GLI1_UP.V1_UP" "E2F1_UP.V1_DN" "E2F1_UP.V1_UP"
[5] "EGFR_UP.V1_DN" "EGFR_UP.V1_UP"
First we need to map the identifiers from the gene sets to the identifiers of the data we are going to analyze:
gsc <- mapIdentifiers(entrezOncogens, AnnoOrEntrezIdentifier(metadata(se.filt)$annotation))
gsc
GeneSetCollection
names: GLI1_UP.V1_DN, GLI1_UP.V1_UP, ..., LEF1_UP.V1_UP (189 total)
unique identifiers: 22818, 143384, ..., 79649 (11250 total)
types in collection:
geneIdType: EntrezIdentifier (1 total)
collectionType: NullCollection (1 total)
Nothing has happend, we can jump this step because data is already anchorated to Entrez ientifiers. Then, we have to start with an incidence matrix indicating what genes belong to what gene set:
Im <- incidence(gsc)
dim(Im)
[1] 189 11250
Im[1:2, 1:10]
22818 143384 140711 57583 81669 54432 79712 23596 91543 6580
GLI1_UP.V1_DN 1 1 1 1 1 1 1 1 1 1
GLI1_UP.V1_UP 0 0 0 0 0 0 0 0 0 0
Next, we discard genes (columns in Im) that do not form part of our data:
Im <- Im[, colnames(Im) %in% rownames(se.filt)]
dim(Im)
[1] 189 6067
Likewise, not all genes in our data are annotated to gene sets, so we also discard them:
se.filt <- se.filt[colnames(Im), ]
dim(se.filt)
[1] 6067 94
dge.filt <- dge.filt[colnames(Im), ]
dim(dge.filt)
[1] 6067 94
Figure 19: Voom plot
[1] 189 6067
[1] 6067
[1] 5.547354 7.512204 5.549996 -8.950190 -7.892440 7.462655
[1] 189
GLI1_UP.V1_DN GLI1_UP.V1_UP E2F1_UP.V1_DN E2F1_UP.V1_UP EGFR_UP.V1_DN
2.918573 -2.710980 -6.409755 -3.131997 1.162975
EGFR_UP.V1_UP
-2.551358
JNK_DN.V1_DN ATF2_S_UP.V1_DN SNF5_DN.V1_UP ALK_DN.V1_DN
22.04500 20.28736 19.57452 18.89280
BCAT_GDS748_DN KRAS.DF.V1_DN
18.48136 17.94397
library(GSVA)
GSexpr <- gsva(assays(se.filt)$logCPM, gsc, min.sz=5, max.sz=300, verbose=FALSE)
class(GSexpr)
[1] "matrix"
dim(GSexpr)
[1] 189 94
mod <- model.matrix(~se.filt$type + bcr_patient_barcode, data = colData(se.filt))
fit <- lmFit(GSexpr, mod)
fit <- eBayes(fit)
tt <- topTable(fit, coef = 2, n = Inf)
DEgs <- rownames(tt[tt$adj.P.Val < 0.01, , drop = FALSE])
length(DEgs)
[1] 108
Figure 20: GSVA volcano plot
sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C
[3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8
[5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8
[7] LC_PAPER=ca_ES.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 methods stats graphics grDevices utils
[8] datasets base
other attached packages:
[1] GO.db_3.5.0 GOstats_2.44.0
[3] Category_2.44.0 Matrix_1.2-14
[5] GSVA_1.26.0 org.Hs.eg.db_3.5.0
[7] GSEABase_1.40.1 graph_1.56.0
[9] sva_3.26.0 BiocParallel_1.12.0
[11] genefilter_1.60.0 mgcv_1.8-23
[13] nlme_3.1-137 geneplotter_1.56.0
[15] annotate_1.56.2 XML_3.98-1.11
[17] AnnotationDbi_1.40.0 lattice_0.20-35
[19] edgeR_3.20.9 limma_3.34.9
[21] SummarizedExperiment_1.8.1 DelayedArray_0.4.1
[23] matrixStats_0.53.1 Biobase_2.38.0
[25] GenomicRanges_1.30.3 GenomeInfoDb_1.14.0
[27] IRanges_2.12.0 S4Vectors_0.16.0
[29] BiocGenerics_0.24.0 knitr_1.20
[31] BiocStyle_2.6.1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.17 locfit_1.5-9.1 rprojroot_1.3-2
[4] digest_0.6.15 mime_0.5 R6_2.2.2
[7] backports_1.1.2 RSQLite_2.1.0 evaluate_0.10.1
[10] highr_0.6 zlibbioc_1.24.0 Rgraphviz_2.22.0
[13] blob_1.1.1 rmarkdown_1.9 shinythemes_1.1.1
[16] splines_3.4.4 stringr_1.3.1 RCurl_1.95-4.10
[19] bit_1.1-14 shiny_1.1.0 compiler_3.4.4
[22] httpuv_1.4.3 xfun_0.1 pkgconfig_2.0.1
[25] htmltools_0.3.6 GenomeInfoDbData_1.0.0 bookdown_0.7
[28] codetools_0.2-15 AnnotationForge_1.20.0 later_0.7.2
[31] bitops_1.0-6 grid_3.4.4 RBGL_1.54.0
[34] xtable_1.8-2 DBI_1.0.0 magrittr_1.5
[37] stringi_1.2.2 XVector_0.18.0 promises_1.0.1
[40] RColorBrewer_1.1-2 tools_3.4.4 bit64_0.9-7
[43] survival_2.42-3 yaml_2.1.19 memoise_1.1.0